An Augmented PAC Model for Semi- Supervised Learning

نویسندگان

  • Maria-Florina Balcan
  • Avrim Blum
چکیده

The standard PAC-learning model has proven to be a useful theoretical framework for thinking about the problem of supervised learning. However, it does not tend to capture the assumptions underlying many semi-supervised learning methods. In this chapter we describe an augmented version of the PAC model designed with semi-supervised learning in mind, that can be used to help think about the problem of learning from labeled and unlabeled data and many of the different approaches taken. The model provides a unified framework for analyzing when and why unlabeled data can help, in which one can discuss both sample-complexity and algorithmic issues. Our model can be viewed as an extension of the standard PAC model, where in addition to a concept class C, one also proposes a compatibility function: a type of compatibility that one believes the target concept should have with the underlying distribution of data. For example, it could be that one believes the target should cut through a low-density region of space, or that it should be self-consistent in some way as in co-training. This belief is then explicitly represented in the model. A PAC Model for Semi-Supervised Learning Unlabeled data is then potentially helpful in this setting because it allows one to estimate compatibility over the space of hypotheses, and to reduce the size of the search space from the whole set of hypotheses C down to those that, according to one's assumptions, are a-priori reasonable with respect to the distribution. After proposing the model, we then analyze sample-complexity issues in this setting: that is, how much of each type of data one should expect to need in order to learn well, and what are the basic quantities that these numbers depend on. We provide examples of sample-complexity bounds both for uniform convergence and ǫ-cover based algorithms, as well as several algorithmic results.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Open Problems in Efficient Semi-supervised PAC Learning

The standard PAC model focuses on learning a class of functions from labeled examples, where the two critical resources are the number of examples needed and running time. In many natural learning problems, however, unlabeled data can be obtained much more cheaply than labeled data. This has motivated the notion of semi-supervised learning, in which algorithms attempt to use this cheap unlabele...

متن کامل

Semi-Supervised Learning on an Augmented Graph with Class Labels

In this paper, we propose a novel graph-based method for semi-supervised learning. Our method runs a diffusion-based affinity learning algorithm on an augmented graph consisting of not only the nodes of labeled and unlabeled data but also artificial nodes representing class labels. The learned affinities between unlabeled data and class labels are used for classification. Our method achieves su...

متن کامل

Does Unlabeled Data Provably Help? Worst-case Analysis of the Sample Complexity of Semi-Supervised Learning

We study the potential benefits of unlabeled data to classification prediction to the learner. We compare learning in the semi-supervised model to the standard, supervised PAC (distribution free) model, considering both the realizable and the unrealizable (agnostic) settings. Roughly speaking, our conclusion is that access to unlabeled samples cannot provide sample size guarantees that are bett...

متن کامل

A Semi-Supervised Clustering Method Based on Graph Contraction and Spectral Graph Theory

Semi-supervised learning is a machine learning framework where learning from data is conducted by utilizing a small amount of labeled data as well as a large amount of unlabeled data (Chapelle et al., 2006). It has been intensively studied in data mining and machine learning communities recently. One of the reasons is that, it can alleviate the time-consuming effort to collect “ground truth” la...

متن کامل

Augmented hashing for semi-supervised scenarios

Hashing methods for fast approximate nearest-neighbor search are getting more and more attention with the excessive growth of the available data today. Embedding the points into the Hamming space is an important question of the hashing process. Analogously to machine learning there exist unsupervised, supervised and semi-supervised hashing methods. In this paper we propose a generic procedure t...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2005